The objective of this article is to explore and model the dataset of homicides in the state of Santa Catarina. After request, the state’s security bureau of Santa Catarina gave us the data. Ian Fukushima, my colleague, inspired me to make the maps and analysis of this data, I’m grateful for his support.
Let’s start by loading packages:
#preliminaries -----------------------------------------
library(knitr) #for kable, stylized table in r markdown
library(tidyverse) #for pipe
library(dplyr) #select() function
library(sp)
library(spdep)
library(rgdal)
library(Amelia) #missmap()
library(ggmap)
library(rgeos) #gUnaryUnion
library(xts) #extended time series
library(magick) # to make gifs
library(randomForest)
library(gstat) #idw function
library(tmap) #to use tmap function
library(randomForest)
library(grid) #to plot two tmaps together
library(raster) #raster()
Include some paths to save images. You must change those paths for your machine
path1 = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/"
path_root = "/home/william/Dropbox/working/projects_git/Crime-SC/"
path_data = "/home/william/Dropbox/working/projects_git/Crime-SC/Dados"
load.csv() will load the data. Ian used Google Maps’ API to find the geographic coordinates. The raw data that came only with the address of each homicide incident.
#prepare base----------------------------------------
#load base
crime = read.csv(file=paste0(path_data,"/df_sc_crime_geocoded.csv"), header=TRUE, sep=",", fileEncoding="latin1")
crime %>% class
## [1] "data.frame"
The data have 10745 incidents of homicide.
crime %>% dim()
## [1] 10745 25
The following are the variables. We are going to use only "Data", "lat","long".
crime %>% names()
## [1] "X" "Crime"
## [3] "Data" "Hora"
## [5] "ID" "Endereço"
## [7] "Município" "Bairro"
## [9] "Tipo.Logradouro" "Logradouro"
## [11] "Número" "Tipo.de.local"
## [13] "Localidade" "PosiTime"
## [15] "lat" "long"
## [17] "accuracy" "formatted_address"
## [19] "address_type" "status"
## [21] "index" "MOTIVAÇÃO"
## [23] "VIOLÊNCIA.DOMÉSTICA" "MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER"
## [25] "ESTUPRO..ANTECEDENTE"
Here is the summary of our data. We are going to use "Data", "lat","long". But variables as "Municipio","Número","Endereço","Bairro" are important when using Google’s API for geocoding.
summary(crime)
## X Crime
## Min. : 1 HOMICÍDIO :9162
## 1st Qu.: 2687 LATROCÍNIO : 665
## Median : 5373 PESSOA MORTA POR POLICIAL MILITAR EM SERVIÇO : 619
## Mean : 5373 LESÃO CORPORAL SEGUIDA DE MORTE : 170
## 3rd Qu.: 8059 PESSOA MORTA POR POLICIAL CIVIL EM SERVIÇO : 63
## Max. :10745 PESSOA MORTA POR POLICIAL MILITAR FORA DE SERVIÇO: 50
## (Other) : 16
## Data Hora ID
## 6/12/2015: 12 21:00 : 638 Min. : 1
## 13/3/2010: 11 23:00 : 604 1st Qu.: 2687
## 13/5/2018: 11 22:00 : 579 Median : 5373
## 31/3/2018: 11 20:00 : 530 Mean : 5373
## 1/1/2013 : 10 0:00 : 512 3rd Qu.: 8059
## 11/7/2010: 10 19:00 : 438 Max. :10745
## (Other) :10680 (Other):7444
## Endereço
## , JOINVILLE, Santa Catarina, BRASIL : 50
## , FLORIANÓPOLIS, Santa Catarina, BRASIL : 22
## RUA SEIS DE JANEIRO , JOINVILLE, Santa Catarina, BRASIL: 21
## , SÃO JOSÉ, Santa Catarina, BRASIL : 15
## , ITAJAÍ, Santa Catarina, BRASIL : 13
## , CRICIÚMA, Santa Catarina, BRASIL : 12
## (Other) :10612
## Município Bairro Tipo.Logradouro
## JOINVILLE :1133 CENTRO : 990 RUA :7234
## FLORIANÓPOLIS:1063 INTERIOR : 321 ESTRADA :1105
## ITAJAÍ : 561 : 216 RODOVIA : 701
## CHAPECÓ : 487 RURAL : 193 AVENIDA : 642
## SÃO JOSÉ : 467 MONTE ALEGRE : 161 : 311
## CRICIÚMA : 380 PARANAGUAMIRIM: 142 SERVIDÃO: 282
## (Other) :6654 (Other) :8722 (Other) : 470
## Logradouro Número Tipo.de.local
## : 322 :7686 VIA PÚBLICA :5728
## GERAL : 185 20 : 21 RESIDÊNCIA :2956
## BR 101 : 71 200 : 20 OUTRO : 848
## BR 470 : 35 30 : 15 ESTABELECIMENTO COMERCIAL: 489
## SANTA CATARINA: 30 55 : 14 BARES E SIMILARES : 452
## GETÚLIO VARGAS: 28 12 : 13 ESTABELECIMENTO PENAL : 122
## (Other) :10074 (Other):2976 (Other) : 150
## Localidade PosiTime
## ZONA URBANA :9851 2014-08-31 03:00:00: 6
## ZONA RURAL : 882 2014-09-06 22:00:00: 6
## NÃO INFORMADA : 4 ZONA URBANA : 6
## \\"CANTINA PER TUTI\\",","VIA PÚBLICA: 1 2014-03-20 19:00:00: 5
## \\"E\\",","VIA PÚBLICA : 1 2015-02-26 04:00:00: 5
## 406, \\"G\\",","VIA PÚBLICA : 1 2016-06-12 06:00:00: 5
## (Other) : 5 (Other) :10712
## lat long accuracy
## -28.6010292: 91 -49.151029 : 91 route :5971
## -26.2834213: 73 -48.8452269: 73 street_address:2136
## -26.6140628: 41 -53.7265491: 41 locality : 790
## -27.6140791: 27 -48.6370861: 27 establishment : 781
## -26.0713285: 25 -48.6177384: 25 premise : 371
## (Other) :10471 (Other) :10471 (Other) : 679
## NA's : 17 NA's : 17 NA's : 17
## formatted_address
## Estrada Geral, Santa Catarina, Brazil : 91
## Joinville - State of Santa Catarina, Brazil : 74
## BR-101, Santa Catarina, Brazil : 41
## Estrada de Linha, Paraíso - SC, 89906-000, Brazil : 41
## São José, Santa Catarina - Barreiros, São José - State of Santa Catarina, Brazil: 27
## (Other) :10454
## NA's : 17
## address_type
## route :5971
## street_address :2136
## locality,political : 790
## premise : 371
## administrative_area_level_2,political: 321
## (Other) :1139
## NA's : 17
## status
## OK :10720
## Error - NA - Verifify : 17
## route : 3
## Chapecó, State of Santa Catarina, Brazil : 1
## clothing_store,establishment,home_goods_store,point_of_interest,store: 1
## establishment,point_of_interest : 1
## (Other) : 2
## index MOTIVAÇÃO VIOLÊNCIA.DOMÉSTICA
## OK : 5 NÃO INFORMADA :3480 NÃO :10178
## 1 : 1 DESAVENÇA :2853 NÃO INFORMADO: 1
## 10 : 1 TRÁFICO DE DROGAS:1726 SIM : 566
## 100 : 1 PASSIONAL : 846
## 1000 : 1 AÇÃO POLICIAL : 745
## 10000 : 1 ROUBO : 655
## (Other):10735 (Other) : 440
## MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER ESTUPRO..ANTECEDENTE
## NÃO :4489 NÃO :10559
## NÃO INFORMADO:6232 NÃO INFORMADO: 160
## SIM : 24 SIM : 26
##
##
##
##
kable(crime[1:3,], caption="original data of homicides") #shows stylized table of data frame
| X | Crime | Data | Hora | ID | Endereço | Município | Bairro | Tipo.Logradouro | Logradouro | Número | Tipo.de.local | Localidade | PosiTime | lat | long | accuracy | formatted_address | address_type | status | index | MOTIVAÇÃO | VIOLÊNCIA.DOMÉSTICA | MENOSP.DISCRIM.CONDIÇÃO.DE.MULHER | ESTUPRO..ANTECEDENTE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | HOMICÍDIO | 1/1/2008 | 0:00 | 1 | RUA MONTEVIDÉU , CHAPECÓ, Santa Catarina, BRASIL | CHAPECÓ | FORTES | RUA | MONTEVIDÉU | VIA PÚBLICA | ZONA URBANA | 2008-01-01 00:00:00 | -27.0996274 | -52.5998438 | route | R. Montevidéu, Chapecó - SC, Brazil | route | OK | 1 | TRÁFICO DE DROGAS | NÃO | NÃO INFORMADO | NÃO | |
| 2 | HOMICÍDIO | 1/1/2008 | 0:00 | 2 | RUA PEGASUS 187, JOINVILLE, Santa Catarina, BRASIL | JOINVILLE | JARDIM PARAÍSO | RUA | PEGASUS | 187 | OUTRO | ZONA URBANA | 2008-01-01 00:00:00 | -26.2144603 | -48.8141942 | route | R. Pegasus - Jardim Paraiso, Joinville - SC, 89240-000, Brazil | route | OK | 2 | TRÁFICO DE DROGAS | NÃO | NÃO INFORMADO | NÃO |
| 3 | HOMICÍDIO | 1/1/2008 | 6:00 | 3 | AVENIDA AMÂNDIO CABRAL , BALNEÁRIO BARRA DO SUL, Santa Catarina, BRASIL | BALNEÁRIO BARRA DO SUL | CENTRO | AVENIDA | AMÂNDIO CABRAL | VIA PÚBLICA | ZONA URBANA | 2008-01-01 06:00:00 | -26.4594161 | -48.6030489 | route | Av. Amândio Cabral, Balneário Barra do Sul - SC, 89247-000, Brazil | route | OK | 3 | TRÁFICO DE DROGAS | NÃO | NÃO INFORMADO | NÃO |
Many variables came in factor format. We transform all variables that are factor into character.
#transform all variables that are factor into character
crime2 = crime
len1 = length(names(crime2))
ii = 1
for (ii in 1:len1) {
vv = crime[,ii]
if (is.factor(vv)) {
vv1 <- as.character(vv)
crime2[,ii] = vv1
}
}
#X and ID show the same thing
identical(crime2$X,crime2$ID)
## [1] TRUE
#years and days, and hours of crime
crime2$Data %>% class
## [1] "character"
#we can aggregate the data into months or trimesters
Take a look at the types of homicide. We’ll consider all of them as just homicide.
kable(crime2$Crime %>% table)
| . | Freq |
|---|---|
| HOMICÍDIO | 9162 |
| INFANTICÍDIO | 8 |
| LATROCÍNIO | 665 |
| LESÃO CORPORAL SEGUIDA DE MORTE | 170 |
| PESSOA MORTA POR POLICIAL CIVIL EM SERVIÇO | 63 |
| PESSOA MORTA POR POLICIAL CIVIL FORA DE SERVIÇO | 8 |
| PESSOA MORTA POR POLICIAL MILITAR EM SERVIÇO | 619 |
| PESSOA MORTA POR POLICIAL MILITAR FORA DE SERVIÇO | 50 |
There are only some lines that are missing, but for variables that we don’t use.
#map of missing in crime2 from Amelia
missmap(crime2)
#transform lat and long into numeric
crime2$lat=as.numeric(crime2$lat)
crime2$long=as.numeric(crime2$long)
#there are a few missings, 25 and 19
is.na(crime2$lat) %>% sum
## [1] 25
is.na(crime2$long) %>% sum
## [1] 19
#drop places with missing lat and long
mis_lat = is.na(crime2$lat)
mis_lon = is.na(crime2$lat)
crime3 = crime2[!(mis_lat|mis_lon),]
crime3 %>% dim()
## [1] 10720 25
#we drop 25 observations
There are some outliers for latitude, for example, -27027.
summary(crime3$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -27027.00 -27.59 -27.05 -34.48 -26.74 45.14
#there are some outliers in lat and long
#proportion of points that does not make sense
summary(crime3$long)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -119.70 -49.58 -48.83 -49.40 -48.65 35.37
len2 = dim(crime3)[1]
len2
## [1] 10720
#10720 number of lines in data
sum(crime3$long < -54)
## [1] 8
#8 number of outliers
sum(crime3$long < -54)/len2
## [1] 0.0007462687
#0.0007462687
#-54 long is more than the most western frontiers of Santa Catarina
#-53.8376609 most west of SC
#-47.2330741 most east of SC
sum(crime3$long > -47)
## [1] 63
#latitude explore outliers
summary(crime3$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -27027.00 -27.59 -27.05 -34.48 -26.74 45.14
#look at google maps to see which values makes sense
# -25.3345494 most north of SC
# -29.5 most south of SC
We use function get_map() from ggmap to get a “screen shot” from Google maps.
# maps ------------------------------------
#google maps
map = get_map(location=c(-53.9, -29.5 ,-47,-25.33),maptype = 'roadmap')
plot(map)
#transform crime3 into spatial points data.frame
coordinates(crime3) = ~long+lat
plot(crime3$long,crime3$lat )
#plot of coordinates doesn't make sense because of outliers
plot(crime3)
Those strange results are important to be shown. Those are some examples of problems that we face using real world data.
crime3 %>% class
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
The projection of a spatial object is important. Below we see that crime3 doesn’t have a projection.
crime3@proj4string #there is no projection
## CRS arguments: NA
Here we give a projection for crime3
crime3@proj4string = CRS("+proj=longlat +datum=WGS84")
plot(crime3) #there are some outliers in the west, it doesn't make sense
points_occur = data.frame(lon=coordinates(crime3)[,1],
lat=coordinates(crime3)[,2])
(crime_sc1=ggmap(map, ylab = "Latitude", xlab = "Longitude") +
#geom_polygon(aes(x=long, y=lat, group=group),
# fill='grey', size=.2, color='blue', data=, alpha=0)+
geom_point(data=points_occur, aes(x=lon, y=lat), size=.5, color='red', alpha=.5)
)
#save image to disk
ggsave("crime_sc1.jpg", width = 20, height = 20, units = "cm",dpi = 600)
#import data of municipalities:
munic = readOGR("~/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer="municip07")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/william/Dropbox/working/RAW_DATA/shpbrasil/municip07.shp", layer: "municip07"
## with 5561 features
## It has 3 fields
#pick up only municipalities that are in the state of Santa Catarina
state = substr(munic$CODIGOIB1,1,2)
munic=munic[state=='42',]
munic@proj4string = CRS("+proj=longlat +datum=WGS84")
plot(munic)
#plot map with municipaility borders and points of crime and google maps
munic_fortify = fortify(munic, region="CODIGOIB1")
munic_fortify %>% class
## [1] "data.frame"
#we first transform spatial polygons into data frame
(crime_sc2=ggmap(map, ylab = "Latitude", xlab = "Longitude") +
geom_polygon(data=munic_fortify, color='black', alpha=0,
aes(y=lat, x=long, group=group, fill='grey'))+
geom_point(data=points_occur, aes(x=lon, y=lat), size=.75, color='red', alpha=.5)
)
#save image
ggsave("crime_sc2.jpg", width = 20, height = 20, units = "cm",dpi = 600)
#we need to exclude points outside santa catarina's borders
#get outer border of Santa Catarina
munic_dissolve = gUnaryUnion(munic)
plot(munic_dissolve)
Note that munic_dissolve is a spatial polygon, then we use it to select values of crime3 which is spatial points. So this is a way to crop values in a spatial points.
#exclude outliers, we solve the problem
crime4 = crime3[munic_dissolve,]
plot(crime4)
Now that we excluded all outliers we have a recognizable map.
#create hexagon polygons --------------------------------------------------
#code from:
#http://strimas.com/spatial/hexagonal-grids/#f1
size = 0.3
set.seed(123)
hex_points = spsample(munic_dissolve, type = "hexagonal", cellsize = size)
hex_points %>% class #spatial points
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"
hex_points %>% plot #just make points in a hexagon fashion
hex_points %>% length #there are 109 points
## [1] 109
hex_grid = HexPoints2SpatialPolygons(hex_points, dx = size)
hex_grid %>% class #from spatial points to spatial polygons
## [1] "SpatialPolygons"
## attr(,"package")
## [1] "sp"
plot(hex_grid)
#number of hexagons in map
len_hex = length(hex_grid) ; len_hex
## [1] 109
plot(munic_dissolve, col = "grey50", bg = "light blue", axes = TRUE)
plot(hex_points, col = "black", pch = 20, cex = 0.5, add = T)
plot(hex_grid, border = "orange", add = T)
#to save image
jpeg("hexagon_sc.jpg")
plot(munic_dissolve, col = "grey50", bg = "light blue", axes = TRUE)
plot(hex_points, col = "black", pch = 20, cex = 0.5, add = T)
plot(hex_grid, border = "orange", add = T)
dev.off()
## png
## 2
hex_fortify = fortify(hex_grid)
(hex_sc2=ggmap(map, ylab = "Latitude", xlab = "Longitude") +
geom_polygon(data=hex_fortify, color='black', alpha=0,
aes(y=lat, x=long, group=group, fill='grey'))+
geom_point(data=points_occur, aes(x=lon, y=lat), size=.75, color='red', alpha=.5)
)
Note that there are some points that are inside any hexagon. Those points will not be accounted in the final data frame. This is a drawback in using hexagon polygons.
ggsave("hex_sc2.jpg", width = 20, height = 20, units = "cm",dpi = 600)
#another view of the data
plot(hex_grid)
plot(crime4, col="red" , add=TRUE)
#prepare spatial polygons data frame
crime4$count = 1
names(crime4)[24] #"count"
## [1] "count"
sp1 = crime4[,24] #take just the "count" variable
sp1 %>% class
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
res = over(hex_grid, sp1 , fn=sum) #count nº of homicides in each hexagon
res %>% dim
## [1] 109 1
is.na(res$count) %>% sum
## [1] 5
res$count[is.na(res$count)] = 0 #hexagons with no crime have 0 crimes
hex_sp = SpatialPolygonsDataFrame(hex_grid, data=data.frame(ID=1:length(hex_grid)), match.ID = FALSE)
hex_sp$crime = res$count
#plot of crime for all periods
spplot(hex_sp,c("crime"))
This map shows for each hexagon how many homicides happened. The numbers shown are for all periods.
#separate for each month ------------------------------------
#the previous maps was built with all periods
#now we need to separate the base for different months
crime5 = crime4 #rename to manipulate
crime5$Data = as.Date(crime5$Data,format="%d/%m/%Y")
summary(crime5$Data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2008-01-01" "2011-02-09" "2014-03-16" "2014-01-01" "2016-12-06" "2019-08-22"
#data max 22/08/2019 almost at the end of the period
#but I expect that this month the prediction will be bad
crime5$Data %>% class
## [1] "Date"
crime5$Data[sample(1:10000,5)]
## [1] "2017-08-21" "2011-07-06" "2010-03-14" "2018-03-03" "2011-12-27"
#create only month data
crime5$Date2 = paste0(substr(as.character(crime5$Data),1,7),"-01")
crime5$Date2[sample(1:10000,5)] %>% kable
| x |
|---|
| 2013-08-01 |
| 2015-10-01 |
| 2018-09-01 |
| 2011-04-01 |
| 2014-01-01 |
crime5$Date2 %>% class
## [1] "character"
crime5$Date2 = as.Date(crime5$Date2,"%Y-%m-%d")
crime5$Date2 %>% class
## [1] "Date"
kable(crime5$Date2[sample(1:10000,5)])
| x |
|---|
| 2018-01-01 |
| 2018-01-01 |
| 2011-05-01 |
| 2015-02-01 |
| 2011-01-01 |
#there are specially few homicides in august 2019, because it is not the full month
crime_c1 = as.numeric(table(crime5$Date2))
timeid = unique(crime5$Date2) #take only unique dates
timeid[1:5]
## [1] "2008-01-01" "2008-02-01" "2008-03-01" "2008-04-01" "2008-05-01"
length(timeid) #there are 140 months
## [1] 140
crime_ts = xts(crime_c1, order.by=timeid)
jpeg("time_crime.jpg", width = 800, height = 300)
plot(crime_ts)
dev.off()
## png
## 2
#idea: build a spatial VAR model with time series of crimes
#need to build a data frame with each line a hexagon-region for each month
#create neighboors link
nb1 = poly2nb(hex_sp, queen=T)
#connections between hexagon-regions
plot(nb1,coordinates(hex_sp))
plot(hex_sp, add=TRUE)
This map shows the hexagon polygons and the links. Those links are used to determine the neighborhood and spatially lagged values.
jpeg("links.jpg", width = 1000, height = 1000)
plot(nb1,coordinates(hex_sp))
plot(hex_sp, add=TRUE)
dev.off()
## png
## 2
W = nb2listw(nb1, glist=NULL, style ="W") #normlize weights to 1
#create list with spatial data frames, with each month
df_crime = data.frame() #general data frame for models
list_poly = list() #list to input spatial polygon data frame
len_hex = length(hex_grid) #variable that contain length of spatial dataframe
ii = 1 #just a test
for (ii in 1:length(timeid)) {
var1 = crime5$Date2 == timeid[ii] #obs of interest
crime6 = crime5[var1,]
res = over(hex_grid, crime6[,24] , fn=sum)
res$count[is.na(res$count)] = 0
hex_sp = SpatialPolygonsDataFrame(hex_grid, data=data.frame(ID=1:len_hex), match.ID = FALSE)
hex_sp$crime = res$count
hex_sp$period = ii
hex_sp$date = timeid[ii]
hex_sp$lag_crime = lag.listw(W,hex_sp$crime)
hex_sp$long = coordinates(hex_sp)[,1]
hex_sp$lat = coordinates(hex_sp)[,2]
list_poly[[ii]] = hex_sp #input the spatial polygons in the list
names(list_poly)[ii] = ii #input name in the list for the spatial polygon
df_crime = rbind(df_crime,hex_sp@data)
#print(ii) #just to see progress
}
total1=length(timeid)*len_hex; total1
## [1] 15260
df_crime %>% dim
## [1] 15260 7
#they should match
#df_crime %>% View()
df_crime[sample(1:total1,5),] %>% kable
| ID | crime | period | date | lag_crime | long | lat | |
|---|---|---|---|---|---|---|---|
| 9642 | 50 | 0 | 89 | 2015-05-01 | 0.5000000 | -53.63526 | -27.05875 |
| 13961 | 9 | 0 | 129 | 2018-09-01 | 0.0000000 | -49.88526 | -28.35779 |
| 9982 | 63 | 0 | 92 | 2015-08-01 | 0.1666667 | -49.73526 | -27.05875 |
| 12499 | 73 | 0 | 115 | 2017-07-01 | 0.5000000 | -51.68526 | -26.79894 |
| 2980 | 37 | 8 | 28 | 2010-04-01 | 1.3333333 | -48.53526 | -27.57836 |
missmap(df_crime)
This is a Moran’s I test for homicides for one month. Note the p-value, there is spatial correlation in homicide.
#moran's I
var2 = list_poly[[1]]$crime
moran.test(var2,W)
##
## Moran I test under randomisation
##
## data: var2
## weights: W
##
## Moran I statistic standard deviate = 2.4378, p-value = 0.007388
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.118433931 -0.009259259 0.002743691
#there is spatial correlation in homicide
The following function creates a ggplot for nº of homicides and the spatial lag of homicides.
#plot graph with vec1 and spatial lag of vec1
scatLag = function(vec1,W){
vec1lag = lag.listw(W,vec1)
h2.df = as.data.frame(cbind(vec1,vec1lag))
plot = #scatterplot x=vec, y=lagged vec1
ggplot(data=h2.df, aes(y=vec1lag, x=vec1)) +
geom_point(size=.5,alpha=0.5)+
coord_equal()+
ggtitle("scatterplot x=homicide,y=lagged homicide")+
xlab("homicide") + ylab("lagged homicide")
plot
}
scatLag(var2,W)
Here we plot how much crime there is and plot them in a GIF. This GIF shows how much crime changes in time.
# maps ang gifs -----------------------------------------
#idea: build maps of how much crime is there each regions
#build maps for each month then plot them in a gif
hex_fortify = fortify(hex_grid)
path_images = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/images/"
The following loop creates all images for GIF.
ii = 1
seq1 = seq(1,length(timeid), by=6)
for (ii in seq1) {
spdf = list_poly[[ii]][,"crime"]
spdf$piece = 1:len_hex
crime_points = data.frame(coordinates(spdf), as.data.frame(spdf$crime))
names(crime_points) = c('lon', 'lat', 'crime')
crime_hex = ggmap(map, ylab = "Latitude", xlab = "Longitude") +
geom_polygon(data=hex_fortify, color='black', alpha=0, size=.1,
aes(y=lat, x=long, group=group, fill=''))+
geom_point(data=crime_points, aes(x=lon, y=lat, color=crime), size=6, alpha=.9)+
scale_colour_distiller(palette = "Spectral", limits=c(0,20))+
ggtitle(paste('crime', timeid[ii]))
ggsave(paste0(path_images,"crime_",1000+ii,".jpg"), width = 20, height = 20, units = "cm",dpi = 70)
#print(ii)
}
This code chunck creates the GIF.
getwd()
## [1] "/home/william/Dropbox/working/projects_git/Crime-SC/tests william"
list.files(path = path_images, pattern = "*.jpg", full.names = T) %>%
map(image_read) %>% # reads each path file
image_join() %>% # joins image
image_animate(fps=1) %>% # animates, can opt for number of loops
image_write("crime_gif.gif") # write to current dir
This GIF show how number of homicides change in each hexagon. The period are changing in 6 months.
#GIF with raster ----------------------------------------------------
#build another type of GIF now with IDW function and raster format
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 90000)
gridded(grid) = TRUE
plot(grid)
We can’t see in this map. But there are 900000 squares in it. This is used in making a IDW (inverse distance weight) interpolation to create continuous maps.
grid %>% class
## [1] "SpatialPixels"
## attr(,"package")
## [1] "sp"
#this procedure is important to make the raster map
#idea: make all the analysis here using a square grid instead hexagonal
path_images3 = "/home/william/Dropbox/working/projects_git/Crime-SC/tests william/images3"
ii = 1
seq1 = seq(1,length(timeid), by=6)
for (ii in seq1) {
spdf = list_poly[[ii]][,"crime"]
#build raster with idw
v = as.numeric(spdf$crime)
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid)
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title=timeid[ii],
title.size=.8,
title.position = c(0.25,.3))
tmap_save(tm = tm1,
filename = paste0(path_images3,"/crime_",1000+ii,".jpg"),
width = 3, height = 3)
#print(ii) #just to follow progress
}
#to build GIF for munic raster
getwd()
list.files(path = path_images3, pattern = "*.jpg", full.names = T) %>%
map(image_read) %>% # reads each path file
image_join() %>% # joins image
image_animate(fps=1) %>% # animates, can opt for number of loops
image_write("crime_gif_raster.gif") # write to current dir
#prepare data frame for modelling --------------------------------
#build lag variables and prepare data frame for modelling
library(dplyr)
df_temp1 = df_crime[,c("period", "crime", "lag_crime")] #select variables of interest
df_temp1[sample(1:total1,10),] %>% kable()
| period | crime | lag_crime | |
|---|---|---|---|
| 555 | 6 | 0 | 0.2000000 |
| 14183 | 131 | 0 | 0.2500000 |
| 4469 | 41 | 0 | 2.3333333 |
| 15180 | 140 | 0 | 0.0000000 |
| 9359 | 86 | 0 | 0.0000000 |
| 10784 | 99 | 0 | 0.2500000 |
| 10730 | 99 | 0 | 0.8333333 |
| 7789 | 72 | 0 | 0.5000000 |
| 9991 | 92 | 0 | 0.0000000 |
| 9097 | 84 | 0 | 0.5000000 |
#the following loop will make temporal lags as new features
jj = 1
df_crime2 = df_crime
for (jj in 1:12) {
#create columns with spatial and temporal lags of rows
#just a filler with NAs for rbind later
toPut = data.frame(matrix(NA,nrow=len_hex*jj,ncol=dim(df_temp1)[2]))
names(toPut) = names(df_temp1)
df_temp2 = df_temp1[1:(total1-len_hex*jj),]
df_temp3 = rbind(toPut,df_temp2)
df_temp4 = df_temp3
names(df_temp4) = c(paste0("period_",jj), paste0("crime_",jj),
paste0("lag_crime_",jj))
df_crime2 = cbind(df_crime2, df_temp4)
#df_crime2 %>% dim
#print(jj)
}
df_crime2[sample(1:total1,3),] %>% kable
| ID | crime | period | date | lag_crime | long | lat | period_1 | crime_1 | lag_crime_1 | period_2 | crime_2 | lag_crime_2 | period_3 | crime_3 | lag_crime_3 | period_4 | crime_4 | lag_crime_4 | period_5 | crime_5 | lag_crime_5 | period_6 | crime_6 | lag_crime_6 | period_7 | crime_7 | lag_crime_7 | period_8 | crime_8 | lag_crime_8 | period_9 | crime_9 | lag_crime_9 | period_10 | crime_10 | lag_crime_10 | period_11 | crime_11 | lag_crime_11 | period_12 | crime_12 | lag_crime_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14154 | 93 | 0 | 130 | 2018-10-01 | 0.3333333 | -50.33526 | -26.53913 | 129 | 0 | 0.1666667 | 128 | 0 | 0.0000000 | 127 | 0 | 0.0 | 126 | 0 | 0.1666667 | 125 | 0 | 0.0 | 124 | 0 | 0.0 | 123 | 0 | 0.5000000 | 122 | 0 | 0.1666667 | 121 | 0 | 0.0000000 | 120 | 0 | 0.0000000 | 119 | 0 | 0.5 | 118 | 1 | 0.1666667 |
| 1047 | 66 | 1 | 10 | 2008-10-01 | 1.4000000 | -48.83526 | -27.05875 | 9 | 0 | 2.2000000 | 8 | 1 | 3.8000000 | 7 | 1 | 1.4 | 6 | 3 | 1.8000000 | 5 | 1 | 1.6 | 4 | 1 | 1.4 | 3 | 2 | 4.6000000 | 2 | 1 | 3.6000000 | 1 | 0 | 2.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| 7067 | 91 | 0 | 65 | 2013-05-01 | 0.3333333 | -50.93526 | -26.53913 | 64 | 0 | 0.3333333 | 63 | 0 | 0.1666667 | 62 | 0 | 0.0 | 61 | 0 | 0.5000000 | 60 | 0 | 0.5 | 59 | 0 | 0.0 | 58 | 0 | 0.1666667 | 57 | 0 | 0.0000000 | 56 | 0 | 0.6666667 | 55 | 0 | 0.3333333 | 54 | 0 | 0.0 | 53 | 0 | 0.8333333 |
Note that there are some missing values in the beginning of the data frame. This happens because we created columns that are lagged values of the table.
missmap(df_crime2)
#drop missing lines
mis1 = apply(df_crime2,1,function(x){anyNA(x)})
#they should match
12*len_hex; sum(mis1)
## [1] 1308
## [1] 1308
We exclude all lines that have missing. Note that the missmap doesn’t have any missing.
df_crime3 = df_crime2[!mis1,]
df_crime3 %>% dim
## [1] 13952 43
missmap(df_crime3) #there are no more missing
anyNA(df_crime3)
## [1] FALSE
df_crime3 %>% names
## [1] "ID" "crime" "period" "date" "lag_crime"
## [6] "long" "lat" "period_1" "crime_1" "lag_crime_1"
## [11] "period_2" "crime_2" "lag_crime_2" "period_3" "crime_3"
## [16] "lag_crime_3" "period_4" "crime_4" "lag_crime_4" "period_5"
## [21] "crime_5" "lag_crime_5" "period_6" "crime_6" "lag_crime_6"
## [26] "period_7" "crime_7" "lag_crime_7" "period_8" "crime_8"
## [31] "lag_crime_8" "period_9" "crime_9" "lag_crime_9" "period_10"
## [36] "crime_10" "lag_crime_10" "period_11" "crime_11" "lag_crime_11"
## [41] "period_12" "crime_12" "lag_crime_12"
#tree models --------------------------------------------
#ideas
#one of the advantages of using a random forest
#algorithm is that we do not assume any form of
#distribution compared for example to the log gaussian
#cox process which assumes a Poisson distribution for
#the point process.
#for our dataset we have 14000 observations
#we have around 100 regions and time is in months
#for future work, we could change
#more, or less, regions. a larger period.
#trimester, or semester, for example.
#instead of using hexagons, we could just use the municipality
#borders as polygons. It makes much more sense.
#there are 293 municipalities in Santa Catarina.
#there will be many more rows in the table.
df_crime3[sample(1:10000,3),] %>% kable
| ID | crime | period | date | lag_crime | long | lat | period_1 | crime_1 | lag_crime_1 | period_2 | crime_2 | lag_crime_2 | period_3 | crime_3 | lag_crime_3 | period_4 | crime_4 | lag_crime_4 | period_5 | crime_5 | lag_crime_5 | period_6 | crime_6 | lag_crime_6 | period_7 | crime_7 | lag_crime_7 | period_8 | crime_8 | lag_crime_8 | period_9 | crime_9 | lag_crime_9 | period_10 | crime_10 | lag_crime_10 | period_11 | crime_11 | lag_crime_11 | period_12 | crime_12 | lag_crime_12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4312 | 61 | 0 | 40 | 2011-04-01 | 0.3333333 | -50.33526 | -27.05875 | 39 | 0 | 0.1666667 | 38 | 0 | 0.1666667 | 37 | 0 | 0.0000000 | 36 | 0 | 0.0000000 | 35 | 0 | 0.3333333 | 34 | 2 | 0.1666667 | 33 | 0 | 0.0000000 | 32 | 0 | 0.0000000 | 31 | 0 | 0.1666667 | 30 | 1 | 0.0000000 | 29 | 0 | 0.1666667 | 28 | 0 | 0.1666667 |
| 4515 | 46 | 0 | 42 | 2011-06-01 | 0.0000000 | -49.58526 | -27.31856 | 41 | 0 | 0.6666667 | 40 | 2 | 0.1666667 | 39 | 1 | 0.0000000 | 38 | 0 | 0.0000000 | 37 | 0 | 0.0000000 | 36 | 0 | 0.0000000 | 35 | 0 | 0.1666667 | 34 | 1 | 0.0000000 | 33 | 0 | 0.3333333 | 32 | 0 | 0.5000000 | 31 | 0 | 0.0000000 | 30 | 0 | 0.1666667 |
| 9297 | 32 | 1 | 86 | 2015-02-01 | 0.1666667 | -50.03526 | -27.57836 | 85 | 0 | 0.5000000 | 84 | 0 | 0.6666667 | 83 | 0 | 0.1666667 | 82 | 0 | 0.6666667 | 81 | 0 | 0.0000000 | 80 | 0 | 0.1666667 | 79 | 0 | 0.0000000 | 78 | 0 | 0.3333333 | 77 | 1 | 0.5000000 | 76 | 0 | 0.1666667 | 75 | 0 | 1.0000000 | 74 | 0 | 0.3333333 |
#for a first analysis we'll use only a subset of variables
data = df_crime3[, c("crime",
"crime_1","lag_crime_1",
"crime_2","lag_crime_2",
"crime_3","lag_crime_3",
"crime_4","lag_crime_4",
"crime_5","lag_crime_5",
"crime_6","lag_crime_6",
"crime_7","lag_crime_7",
"crime_8","lag_crime_8",
"crime_9","lag_crime_9",
"crime_10","lag_crime_10",
"crime_11","lag_crime_11",
"crime_12","lag_crime_12",
"long", "lat")]
hist(data$crime, breaks=50)
This number is important for the division of training and test datasets.
12*109 #one year of forecasting
## [1] 1308
len_hex #number of hexagons in SC
## [1] 109
len_hex*12
## [1] 1308
set.seed(1)
nrow = nrow(data)
train = 1:(nrow-len_hex*12)
data %>% names %>% kable
| x |
|---|
| crime |
| crime_1 |
| lag_crime_1 |
| crime_2 |
| lag_crime_2 |
| crime_3 |
| lag_crime_3 |
| crime_4 |
| lag_crime_4 |
| crime_5 |
| lag_crime_5 |
| crime_6 |
| lag_crime_6 |
| crime_7 |
| lag_crime_7 |
| crime_8 |
| lag_crime_8 |
| crime_9 |
| lag_crime_9 |
| crime_10 |
| lag_crime_10 |
| crime_11 |
| lag_crime_11 |
| crime_12 |
| lag_crime_12 |
| long |
| lat |
data %>% names %>% length #27 variables in data frame
## [1] 27
data %>% names %>% length %>% sqrt() #input in mtry
## [1] 5.196152
Create a dataset for training and test. We use a spatiotemporal model, so it is not a cross-section. Hence the division between training and test datasets has a temporal aspect.
trainX = data[train,]
trainY = data$crime[train]
testX = data[-train,]
testY = data$crime[-train]
We run the random forest model. Note that we use 100 trees in boostraping.
m1 = randomForest(crime~.,trainX,
mtry=5,importance=TRUE, ntree=100)
m1
##
## Call:
## randomForest(formula = crime ~ ., data = trainX, mtry = 5, importance = TRUE, ntree = 100)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.8480476
## % Var explained: 72.58
This is the MSE for training set.
yh1 = predict(m1,trainX)
mean((trainY - yh1)^2)
## [1] 0.1773787
This is the MSE for test set.
yh = predict(m1,testX)
mean((testY - yh)^2)
## [1] 0.6976304
plot(yh, testY)
abline(0,1)
#importance(m1)
varImpPlot(m1)
This is an importance plot. This plot shows which variables are more important. lag_crime_6,lag_crime_9,lag_crime_12,lag_crime_12 are the most relevant. Which is strange. Those variables are the spatially lagged values for 6, 9 and 12 months ago. I expected that the immediate temporal lag would be the most important variable to explain homicide numbers.
Here we plot maps showing the prediction and the ture values.
#plot of predict for 180901 -------------------------
#prepare a data frame for ploting of results
#predict Data Frame
yh[1:4] #vector of predictions
## 13953 13954 13955 13956
## 0.5888333 0.1826227 0.6178333 0.8923333
Here we plot residuals for some predictions.
12 #number of intervals to choose
## [1] 12
len_hex #number of hexagons in SC
## [1] 109
We plot the prediction for the first month of the test set.
period1 = 1
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1)) #interval to extract from yh
preDf = yh[inter1]
#they should be the same as
preDf %>% length; len_hex
## [1] 109
## [1] 109
#plot prediction and residuals
#present prediction and true values
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
#build map
spdf = list_poly[[1]][,"crime"]
spdf %>% class #109 regions
v = as.numeric(preDf) #include variable of interest
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title= paste0("predict ",timeid[period1+(140-12)]),
title.size=.8,
title.position = c(0.25,.3))
Here we prepare data for plot of actual value of homicides.
#plot of actual for 180901
period1
## [1] 1
#build map
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v = as.numeric(spdf$crime)
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
tm2 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title= paste0("true ",timeid[period1+(140-12)]),
title.size=.8,
title.position = c(0.25,.3))
current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tm3
tmap_save(tm = tm3,
filename = paste0("crime_","comp1",".jpg"),
width = 3, height = 3)
#plot of predict for 190601 -------------------------
period1 = 10
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1]
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
#build map
spdf = list_poly[[1]][,"crime"]
spdf %>% class #109 regions
v = as.numeric(preDf) #include variable of interest
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
tm1 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title= paste0("predict ",timeid[period1+(140-12)]),
title.size=.8,
title.position = c(0.25,.3))
#build map
period1
## [1] 10
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v = as.numeric(spdf$crime)
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm2 = tm_shape(raster_idw) +
tm_raster("crime", style = "fixed", breaks = c(0,0.5,1,2,3,4,5,6),
n = 4, palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .5,
legend.position = c("left","bottom"),
title= paste0("true ",timeid[period1+(140-12)]),
title.size=.8,
title.position = c(0.25,.3))
current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tm3
tmap_save(tm = tm3,
filename = paste0("crime_","comp2",".jpg"),
width = 3, height = 3)
#residuals maps --------------------------------------
#create raster to plot idw
pol = as(munic_dissolve,"SpatialPolygons")
grid = spsample(pol, type = 'regular', n = 100000)
gridded(grid) = TRUE
period1 = 1
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1]
period1
## [1] 1
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v1 = as.numeric(spdf$crime) #v1 is the true crime
v = v1 - preDf #preDf is the predicted value of crime
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm1 = tm_shape(raster_idw) +
tm_raster("crime", style='quantile',
n=7 ,palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .3,
legend.position = c("left","bottom"),
title= paste0("residuals ",timeid[period1+(140-12)]),
title.size=.4,
title.position = c(0.25,.3))
moran.test(v,W)
##
## Moran I test under randomisation
##
## data: v
## weights: W
##
## Moran I statistic standard deviate = -1.5692, p-value = 0.9417
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.101762910 -0.009259259 0.003474902
#for period 1 of prediction, there is no spatial correlation in the residuals
#now, for period 10
period1 = 10
init1 = len_hex*(period1-1)+1
inter1 = (init1:(init1+len_hex-1))
preDf = yh[inter1]
ii = period1+(140-12)
spdf = list_poly[[ii]][,"crime"]
v1 = as.numeric(spdf$crime) #v1 is the true crime
v = v1 - preDf #preDf is the predicted value of crime
df1 = data.frame(v)
coordinates(df1) = coordinates(spdf)
proj4string(df1) = proj4string(spdf)
names(df1@data) = c("var1")
idw = idw(var1 ~ 1, df1, grid) #build raster with idw
## [inverse distance weighted interpolation]
idw.output = as.data.frame(idw)
names(idw.output)[1:3] = c("long", "lat", "crime")
spg = idw.output
coordinates(spg) = ~ long + lat
gridded(spg) = TRUE
raster_idw = raster(spg)
projection(raster_idw) = projection(spdf)
# map with munic borders
tm2 = tm_shape(raster_idw) +
tm_raster("crime", style='quantile',
n=7 ,palette = "Spectral", legend.show = TRUE)+
tm_shape(munic)+
tm_borders(alpha=.2, col = "black", lwd=.1)+
tm_layout(legend.text.size = .3,
legend.position = c("left","bottom"),
title= paste0("residuals ",timeid[period1+(140-12)]),
title.size=.4,
title.position = c(0.25,.3))
moran.test(v,W)
##
## Moran I test under randomisation
##
## data: v
## weights: W
##
## Moran I statistic standard deviate = -1.9558, p-value = 0.9748
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.118679340 -0.009259259 0.003129925
#for period 10 of prediction, there is no spatial correlation in the residuals
current.mode = tmap_mode("plot")
tm3 = tmap_arrange(tm1,tm2)
tmap_mode(current.mode)
tm3
tmap_save(tm = tm3,
filename = paste0("crime_","res",".jpg"),
width = 3, height = 3)
# end -------------------------------------
Here we try to predict number of homicides in each polygon. For another article we can predict the number of homicide in each municipality. We can incorporate data about population, GDP per capita and inequality. We could find other crime variables. We can study larger periods, for example, trimester or semester.
Given that the data have coordinates we can study inside city occurances. Model phenomena in the street level.
It is relevant to find a way to interpret how variables (features) affect our variable of interest, \(y\). In a future article I’ll build a Shap like method to intrepret outcomes of black-box like models.